Business Analytics

Advanced Data Visualizations

Ayush Patel and Jayati Sharma

16 March, 2024

Pre-requisite

You already….

  • Know basic and advanced data wrangling functions in R
  • Know basics of data visualization in R
  • Can write functions in R

Before we begin

Please install and load the following packages

library(dplyr)
library(tidyverse)
library(scales)
library(patchwork)
library(gghighlight)
library(ggiraph)
library(ISLR2)
library(openintro)
library(janitor)
library(gapminder)



Access lecture slide from the course landing page

About me

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

Learning Objectives

  • Learn annotation for graphs in R
  • Learn how to combine graphs
  • Learn scaling functions in R
  • Learn how to make ggplot graphs interactive

Let’s Recap

  • In the data visualization lecture, you learnt how to create various types of graphs using ggplot2
  • Some of them include bar, graph, line graph, scatter plots etc
  • For effective data visualization and communication, any plot requires modifications
  • These include annotations on the plot, modification of axes and scales, highlighting and interactivity of the plot
  • The aim of this lecture is to move beyond making graphs, towards clear and effective visualizations

Annotations in ggplot - Text

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • In addition to plotting your graph, you want to provide additional details to explain your graph
  • Text annotations are useful in this case
  • The annotate() function can be used for any kind of geometric object
  • In the annotate() function, the type of geom is specified first
  • Then, the positining is required (x and y coordinates in this case)
  • This is followed by the label
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("text", x = 4, y = 25, label = "Annotation Text")

Annotations in ggplot

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • Further, annotations can be customized
 ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("text", x = 4, y = 25, label = "Annotation Text", colour = "orange", size = 8)

 ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("text", x = 1:5, y = 6, label = "Annotation Text", colour = "orange", size = 3)

Annotations in ggplot

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • Similar to text annotation, other geoms can be used for annotations
  • However, instead of x and y, xmin and xmax is used for coordinates of the rectangle
  • Do you remember what alpha is used for?
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("rect", xmin = 4.8, 
            xmax = 5.7,
            ymin = 10,
            ymax = 18.6, 
            alpha = .2)

Annotations in ggplot

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • Suppose you want to add a line segment to your graph
  • annotate() over here requires x and xend coordinates
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("segment", x = 4.8,
            xend = 5.7,
            y = 10,
            yend = 18.6,
            colour = "red")

Do it Yourself -1

  • Load the Auto data from ISLR2 package
  • Make a scatterplot of horsepower and acceleration
  • To the plot, add the text Horsepower vs. Acceleration
  • Add a rectangle to the plot, such that it covers the area where horsepower is higher than 200 but acceleration is still lesser than 15
  • Add a line to the plot from the coordinates (50,10) to (150,20)

Scales Functions in ggplot2 - Why?

  • When you create a graph, using ggplot2, the axes are mapped automatically based on the data
  • However, you would often need to change the axes in order to effectively present the data
  • the scale functions in ggplot2:
    • control how the data is plotted
    • allow manipulation of axes
    • improves overall appearances of the plot for effective data communication

Scales Functions in ggplot2

  • Look at the scatter plot of wt and mpg
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()

  • What if you want both the axes to start from 0?
  • scale_y_continuous() allows you to set the range for the y-axis
  • limits inside the scale_y_continuous() provides limits of the scale
  • Over here, NA is used to refer to the existing maximum
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_continuous(limits = c(0, NA))

Scales Functions in ggplot2

  • Instead of using NA, if you had to provide 40 as the limit for y-axis
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_continuous(limits = c(0, 40))

Scales Functions in ggplot2

  • Setting breaks in the scale_y_continuous allows you to set what intervals the axis will have
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_continuous(breaks = seq(0, 40, 7))

Scales Functions in ggplot2

  • Similarly, there are other transformations for scale - reversing the scale
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_reverse()

  • scale_y_log10() does log transformation of the scale
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_log10()

Scales Functions in ggplot2

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • The scale_colour_brewer() options are useful for plotting discrete values on your graph
  • The brewer scales provide sequential colour schemes from ColorBrewer
  • Look at the two charts
  • scale_colour_brewer helps in effcient mapping of discrete variables
ggplot(mpg, aes(x = displ, y = cty)) +
   geom_point(aes(colour = class))

ggplot(mpg, aes(x = displ, y = cty)) +
   geom_point(aes(colour = class))+
  scale_colour_brewer()

Do It Yourself -2

  • Using the Auto data, plot a scatterplot between y = weight and x = displacement
  • Set the x-axis with breaks as 50
  • Set the breaks for y-axis as 5
  • What variable according to you can be used as the colour for the points? How?

Scales Package

  • The scales package many scaling functions for visualizations
  • It allows for sophisticated customisation of data in a plot
  • Functions for readable and informative axes

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • Look at the following chart made using txhousing data
  • We want to make it more readable and clear
  • The number of zeroes on the y-axis can be reduced along with the way years are represented on the x-axis
txhousing %>% 
  mutate(date = make_date(year, month, 1)) %>% 
  group_by(city) %>% 
  filter(min(sales) > 500) %>% 
  ggplot(aes(date, sales, group = city)) + 
  geom_line(na.rm = TRUE)

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • Similar to the scale functions in ggplot2, the scales package has functions for breaks and labels
  • the breaks_width function provides a way to show every two years on the axis, while the label_date provides a way to show the last two digits of the year using %y, making it more clear
  • On the y-axis. the cut_short_scale() function removes the additional 0 and supplements the K sign
txhousing %>% 
  mutate(date = make_date(year, month, 1)) %>% 
  group_by(city) %>% 
  filter(min(sales) > 500) %>% 
  ggplot(aes(date, sales, group = city)) + 
  geom_line(na.rm = TRUE) + 
  scale_x_date(
    NULL,
    breaks = scales::breaks_width("2 years"), 
    labels = scales::label_date("'%y")) + 
  scale_y_log10(
    "Total sales",
    labels = scales::label_number(scale_cut = scales::cut_short_scale()))

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • Let us try modifying another graph using economics data
economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line()

  • How can this be made more readable?
  • For x-axis, one option is to show the dates along with a few months, for better insights
  • For the y-axis, a label that adds the dollar sign would make the chart more readable

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • breaks_width sets intervals for 3 months
  • However, you might want the axes to have the date format in months
  • label_date_short() does the task of shortening the date lengths
economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line() + 
  scale_x_date(NULL,
    breaks = scales::breaks_width("3 months"), 
    labels = scales::label_date_short())

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • For the y-axis, you can set breaks as you desire using breaks_extended()
  • label_dollar() adds a dollar sign to the y-axis
economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line() + 
  scale_x_date(NULL,
    breaks = scales::breaks_width("3 months"), 
    labels = scales::label_date_short()) + 
  scale_y_continuous("Personal consumption expenditures",
    breaks = scales::breaks_extended(8),
    labels = scales::label_dollar())

Do it Yourself -3

  • Load the tourism data from openintro
  • Make a line graph of year and tourist spending
  • Is there any change you could make to the chart for better readability?

Patchwork

  • You have made multiple by now and want to combine them into the same graphic
  • A very easy way to do this by using patchwork
  • Let us learn this using our recently made plots
p1 <- economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line()

p2 <- economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line() + 
  scale_x_date(NULL,
    breaks = scales::breaks_width("3 months"), 
    labels = scales::label_date_short())

p3 <- economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line() + 
  scale_x_date(NULL,
    breaks = scales::breaks_width("3 months"), 
    labels = scales::label_date_short()) + 
  scale_y_continuous("Personal consumption expenditures",
    breaks = scales::breaks_extended(8),
    labels = scales::label_dollar())

Patchwork

  • The usage of patchwork is very simple: you literally just add plots together!
p1 + p2

  • You can also put the plots one below the other
p1 / p2

Patchwork

  • While plots p1 and p2 show the intermediate steps, p3 is the final plot
  • It would be better to have the two at the top and the final one at the bottom
(p1 + p2) / p3

Patchwork

  • After combining all the plots, you would want to modify all plots at once
patchwork <- (p1 + p2) / p3
patchwork & theme_minimal()

Do it Yourself - 4

  • From the tourism data, make line charts of year and visitor_count_tho, one for each decade
  • Combine these charts in such a way that at the top, the graph for all years is displayed and below it, there are 5 charts, one for each decade

Highlight information - gghighlight()

Content for this topic has been sourced from Hiroaki Yutani’s work. Please check out the work for detailed information.

  • Run the following code to generate a random dataset
set.seed(2)
data <- purrr::map_dfr(letters, ~ data.frame(
      id = 1:500,
      value = cumsum(runif(500, -5, 5)),
      type = .,
      flag = sample(c(TRUE, FALSE), size = 500, replace = TRUE),
      stringsAsFactors = FALSE))
  • Suppose you want to plot the value of each id
ggplot(data) +
  geom_line(aes(x= id, y = value, colour = type))

Highlight information - gghighlight()

Content for this topic has been sourced from Hiroaki Yutani’s work. Please check out the work for detailed information.

  • Too much clutter right?
  • You can highlight only the lines with the biggest value
data_filtered <- data %>%
  group_by(type) %>% 
  filter(max(value) > 112) %>%
  ungroup()

ggplot(data_filtered) +
  geom_line(aes(id, value, colour = type))

Highlight information - gghighlight()

Content for this topic has been sourced from Hiroaki Yutani’s work. Please check out the work for detailed information.

  • However, that takes away the context of our visualization
  • Also, very tiresome
  • gghighlight() comes handy here
  • Think of it as the filter() equivalent of ggplot2
ggplot(data) +
  geom_line(aes(id, value, colour = type)) +
  gghighlight(max(value) > 112)

  • You can customise plots as well
ggplot(data) +
  geom_line(aes(id, value, colour = type)) +
  gghighlight(max(value) > 112)+
  theme_minimal()

Highlight information - gghighlight()

Content for this topic has been sourced from Hiroaki Yutani’s work. Please check out the work for detailed information.

  • Similarly, you can highlight other types of geoms as well
ggplot(mpg)+
  geom_point(aes(displ, cty))+
  gghighlight(displ >= 5)

  • You can use more than one condition for gghighlight
ggplot(mpg)+
  geom_point(aes(displ, cty))+
  gghighlight(displ >= 5, cty >= 15)

Highlight information - gghighlight()

Content for this topic has been sourced from Hiroaki Yutani’s work. Please check out the work for detailed information.

  • You can select the number of values you want to highlight using max_highlight
ggplot(data) +
  geom_line(aes(id, value, colour = type)) +
  gghighlight(max(value), max_highlight =  5L)

Do it Yourself - 5

  • Use txhousing data from ggplot2 and make a new dataset called txhousing_sales which calculates the total sales of each city for the year
  • Now, make a line chart of total sales over the years for all cities. Highlight the top 4 cities
  • From txhousing_sales, filter for the year 2015 and make a bar chart which shows the total_sales for all cities. Highlight the top 5

Creating interactive Visualizations - ggiraph()

Content for this topic has been sourced from Albert Rapp’s work. Please check out the work for detailed information.

  • Til now, we have learnt so many ways of making effective visualizations
  • When we add interactivity to our plots, it would make it very easy for people to focus on what they find important
  • ggiraph helps turn our graphs into an interactive visualization
  • It can be combined with many other features to connect different visualizations and communicate data-driven insights
  • Run the following code for data preperation
gapminder_data <- gapminder::gapminder %>%
  janitor::clean_names() %>%
  mutate(id = levels(continent)[as.numeric(continent)],
    continent = forcats::fct_reorder(continent, life_exp))

mean_life_exps <- gapminder_data %>% 
  group_by(continent, year, id) %>%
  summarise(mean_life_exp = mean(life_exp)) %>%
  ungroup()

Creating interactive Visualizations - ggiraph()

Content for this topic has been sourced from Albert Rapp’s work. Please check out the work for detailed information.

  • Let us make a simple line chart first
line_chart <- mean_life_exps %>%
  ggplot(aes(x = year, y = mean_life_exp, col = continent)) +
  geom_line(linewidth = 2.5) +
  geom_point(size = 3.5) +
  theme_minimal(base_size = 18) +
  labs(x = element_blank(),
    y = 'Life expectancy (in years)',
    title = 'Life expectancy over time')
line_chart

Creating interactive Visualizations - ggiraph()

Content for this topic has been sourced from Albert Rapp’s work. Please check out the work for detailed information.

  • We want to make geom_point() and geom_line() interactive
  • This can be done by
    • changing geom_point() to geom_point_interactive()
    • changing geom_line() to geom_line_interactive()
    • adding the data_id aesthetic
line_chart <- mean_life_exps %>%
  ggplot(aes(x = year, y = mean_life_exp, col = continent, data_id = id)) +
  geom_line_interactive(linewidth = 2.5) +
  geom_point_interactive(size = 3.5) +
  theme_minimal(base_size = 18) +
  labs(x = element_blank(),
    y = 'Life expectancy (in years)',
    title = 'Life expectancy over time')
line_chart

girafe(ggobj = line_chart)

Creating interactive Visualizations - ggiraph()

Content for this topic has been sourced from Albert Rapp’s work. Please check out the work for detailed information.

  • Currently, when you click on a line, it appears orange
  • We want to set our interactivity in such a way that when we click on a line, all other lines must fade away
  • This can be done by passing hover options to our interactive chart
girafe(ggobj = line_chart,
  options = list(
    opts_hover(css = ''), ## CSS code of line we're hovering over
    opts_hover_inv(css = "opacity:0.1;"), ## CSS code of all other lines
    opts_sizing(rescale = FALSE)),
  height_svg = 6,
  width_svg = 9)

Creating interactive Visualizations - ggiraph()

Content for this topic has been sourced from Albert Rapp’s work. Please check out the work for detailed information.

  • Next, let us try to make an inteactive boxplot using gapminder_data
  • We want to make a scatterplot of life expectancy and population across continents for the year 2007
  • First, we would make the interactive scatterplot
scatterplot_graph <- gapminder_data %>%
  filter(year == 2007) %>%
  ggplot(aes(x = life_exp, y = pop, fill = continent, data_id = id)) +
  geom_point_interactive(aes(col = continent), size = 3) +
  theme_minimal(base_size = 18) +
  labs(y = 'Population',
    title = 'Life expectancy vs Population in 2007')

scatterplot_graph

  • Next, to make all other continents’ points fade away when we click on one continent points, we use hover options
girafe(ggobj = scatterplot_graph,
  options = list(opts_hover(css = ''),
    opts_hover_inv(css = "opacity:0.1;"),
    opts_sizing(rescale = FALSE)),
  height_svg = 6,
  width_svg = 9)

Combining ggiraph and patchwork Together

Content for this topic has been sourced from Albert Rapp’s work. Please check out the work for detailed information.

  • You can combine two interactive charts using patchwork to show the interaction of different variables at the same time
  • However, this can only be done when the underlying data_id is the same
girafe(
  ggobj = scatterplot_graph + plot_spacer() + line_chart + plot_layout(widths = c(0.45, 0.1, 0.45)),
  options = list(opts_hover(css = ''),
    opts_hover_inv(css = "opacity:0.1;"), 
    opts_sizing(rescale = FALSE)),
  height_svg = 8,
  width_svg = 12)